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Abstract 



Covariance graphical lasso applies a lasso penalty on the elements of the co- 
variance matrix. This method is useful because it not only produces sparse esti- 
mation of covariance matrix but also discovers marginal independence structures 
by generating zeros in the covariance matrix. We propose and explore two new al- 
gorithms for solving the covariance graphical lasso problem. Our new algorithms 
are based on coordinate descent and ECM. We show that these two algorithms are 
more attractive than the only existing competing algorithm of Bien & Tibshirani 
(2011) in terms of simplicity speed and stability. We also discuss convergence 
properties of our algorithms. 

Key words: Coordinate descent; Covariance graphical lasso; Covariance matrix estimation; 
L\ penalty; ECM algorithm; Marginal independence; Regularization; Shrinkage; Sparsity 



Bien & Tibshirani (2011) proposed a covariance graphical lasso procedure for simul- 
taneously estimating covariance matrix and marginal dependence structures. Let S 
be the sample covariance matrix such that S = Y'Y/n where Y(n x p) is the data 
matrix of p variables and n samples. A basic version of their covariance graphical 
lasso problem is to minimize the following objective function: 



over the space of positive definite matrices M + with p > being the shrinkage pa- 
rameter. Here, £ = (ay) is the p x p covariance matrix and ||£||i = ^2 1<ij<p \o~ij\ 
is the Li-norm of S. A general version of the covariance graphical lasso in Bien & 
Tibshirani (2011) allows different shrinkage parameters for different elements in S. 
To ease exposition, we describe our methods in the context of the simple case of com- 
mon shrinkage parameter in (1). All of our results can be extended to the general 
version of different shrinkage parameters with little difficulty. 

Because of the Li-norm term, the covariance graphical lasso is possible to set some 
of the off-diagonal elements of £ exactly equal to zero in its minimum point of (1). 



1 Introduction 




(1) 



Zeros in £ encode marginal independence structures among the components of a 
multivariate normal random vector with covariance matrix S. It is distinctly different 
from the concentration graphical models (also referred to as covariance selection 
models due to Dempster 1972) where zeros are in the concentration matrix S 1 and 
are associated with conditional independence. 

The objective function (1) is not convex, imposing computational challenges for 
minimizing it. Bien & Tibshirani (2011) proposed a complicated majorize-minimize 
approach that iteratively solves convex approximation to the original nonconvex prob- 
lem. In the current paper, we develop two alternative algorithms for minimizing 
(1): the coordinate descent algorithm and the Expectation/Conditional Maximiza- 
tion (ECM) algorithm. We discuss their convergence properties and investigate their 
computational efficiency through simulation studies. In comparison with Bien & Tib- 
shirani (2011)'s algorithm, our new algorithms are much simpler to implement, sub- 
stantially faster to run and numerically more stable in our tests. 

Neither the coordinate descent algorithm nor the ECM algorithm is new to the 
model fitting for regularized problems. The coordinate descent algorithm is shown to 
be very competitive for fitting convex and even some non-convex penalized regression 
models (Friedman et al., 2007; Wu & Lange, 2008; Breheny & Huang, 2011) as well as 
the concentration graphical models (Friedman et al., 2008) . The ECM algorithm (or 
its variants) has been developed in the Bayesian framework for finding the maximum 
a posteriori (MAP) estimation of regularized linear regression models (Poison & Scott, 
2011; Armagan et al., 2011). However, our application of these two algorithms to 
covariance graphical lasso models is new and unexplored before. In this sense, our 
work documents that the coordinate descent algorithm and the ECM algorithm are 
also quite powerful in solving covariance graphical lasso models. 



Our first algorithm to minimize the objective function (1) uses the simple idea of 
coordinate descent methods. We show how to update £ one column and row at a 
time while holding all of the rest elements in £ fixed. Without loss of generality, we 
focus on the last column and row. Partition £ and S as follows: 



where (a) S n and Sn are the covariance matrix and the sample covariance matrix 
of the first p — 1 variables, respectively; (b) <r 12 and sn are the covariances and the 
sample covariances between the first p— 1 variables and the last variable, respectively; 
and (c) cr 2 2 and s ss are the variance and the sample variance of the last variable, 
respectively. 



2 TWO PROPOSED ALGORITHMS 



2.1 Coordinate descent algorithm 




(2) 



2 



Let 

(3 = C12, 7 = (T22 — cr' 12 Y:^(Ti2, 
and apply the block matrix inversion to £ using blocks (En, (3, 7): 

S "I -/3'SnS- 1 7- 1 J' (3) 

After removing some constants not involving 7), the three terms in (1) can be 
expressed as a function of ((3, 7) : 

log(detS) = log( 7 ), 

trtSE- 1 ) = ^S^SnS^- 1 - 2s , 12 £ n 1 /3 7 - 1 + s 22l ~\ 

HiSHx = 2p||/3|| 1 + p(/3 / S n 1 /3 + 7), (4) 

which lead to the following objective function with respect to (/3, 7) : 

miJlog^H/^SiiSn 1 ^ (5) 
For 7, removing terms in (5) that do not depend on 7 gives 

min|log(7) + cry -1 +pjj, 

where a = (3 / T,^Sii'E^(3 — 2s 12 S n 1 /3 + s 2 2- Clearly, it is solved as follows: 

A _ / a if P = 0, , . 

7 1 (-l + v / T+4^)/(2p) ifp^O. w 

For /3, removing terms in (5) that do not depend on (3 gives 

rmn j/3'V/3 - 2u'/3 + 2^/311! j, (7) 

where V = (vij) = S^SnS^^ -1 + pSJi, u = s^S^iS -1 . The problem in (7) is 
a lasso problem and can be efficiently solved by fast coordinate descent algorithms 
(Friedman et al., 2007; Wu & Lange, 2008). Specifically for j E {1, . . . ,p - 1}, the 
minimum point of (7) along the coordinate direction in which varies is: 

0j = S(uj - y^ y v k j/3k,p)/vjj, (8) 

where S is the soft-threshold operator: 

S(x, t) = sign(x)(|x| — 1) + . 
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The update (8) is iterated for j = 1, ... ,p — 1,1,2, ... , until convergence. We then 
update the column as (cr 12 = (3, cr 2 2 = 7 + followed by cycling through all 

columns until convergence. This algorithm can been viewed as a block coordinate 
descent method with p blocks of (3s and another p blocks of 7s. The algorithm is 
summarized as follows: 

Coordinate descent algorithm Given input (S,p), start with E^ 0) , at the (k + l)th 
iteration (k = 0, 1, . . .) 

1. Let S (fc+1) = S (fc) . 

2. For i = 1, . . . ,p, 

(a) Partition S (fc+1) and S as in (2). 

(b) Compute 7 as in (6). 

(c) Solve the lasso problem (7) by repeating (8) until convergence. 

(d) Update <r% +1) = /?, = /3>, ag +1) = 7 + 

3. Let k — k + 1 and repeat (l)-(3) until convergence. 



2.2 ECM algorithm 

The second algorithm to find the estimator £ = argmin£ 6M+ g(£) utilizes repre- 
sentation of the solution to (1) as a MAP estimator £ for the posterior distribution: 
p(£ I S) oc exp{— ^g(£)}. We propose to use the ECM algorithm for iterative compute 
its modes. The key to the ECM algorithm is the choice of latent variables. Note that, 
for any off-diagonal element cr^-, we have the following well-known representation: 

/oo 2 2 

(27TTy)~5 exp(-^-)y exp(-yr jj )dr ii , (9) 

where the latent scale parameter. This suggests that the posterior distribution p(E \ 
S) oc exp{— |g(£)} is the marginal distribution of the following complete posterior 
distribution of (£, r) where r = (t^)^-: 

2 



p(E,T I S) oc |S|-lexp{-^tr(SS)} J] exp(-^-) exp( 

x J]|exp(-^)}. (10) 



Consequently the covariance graphical lasso estimator from (1) is equivalent to the 
marginal mode for £ in (10). We now describe how to compute this marginal mode 
for £ using the ECM algorithm (Meng & Rubin, 1993). 
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To perform the E-step, we calculate the expected complete log-posterior by replac- 
ing r^ 1 with their conditional expectations E(r i ^ 1 | S,E^) given the current E^. 
Following the standard results of the inverse-Gaussian distribution, we have 

Ef— IS, P 



T, 



(*)| 



which leads to the following criterion function after removing the terms that do not 
depend on E, 



Q(E | E«) = J togp(S,r | S)p(r I EW,S)d 



T 

2 

>J_ 

(*)| 



oc -log(det E) - tr(SS- x ) - p ^ -^f- - p ^ o«. (11) 

The CM-step then maximizes (11) along each column (row) of E. Again, without 
loss of generality, we focus on the last column and row. Partition E and S as in (2) and 
consider the same transformation from (er 12 , cr 22 ) to (f3 = cr 12 , 7 = a 22 — cr / 12 S 11 1 cr 1 2). 
After dropping of terms not involving (/3, 7), the four terms in (11) can be written as 
functions of (/3, 7) 

log(det E) = log( 7 ) 

t^SE" 1 ) = /O'E^SuE^./St" 1 " 2s' 12 E n 1 /37~ 1 + W -1 , 



p^a ll = p(/3 / E n 1 /3 + 7). 

i 

where D = diag(|tr^|). Holding all but (/3,7) fixed, we can then rewrite (11) as 

Q(J3, 7 I E«) = -|log( 7 ) + /^SuEri/ST" 1 - 2s' 12 E n 1 /3 7 " 1 + ^227"' 

+p/3'D- 1 /3 + p^'E^ + p 7 j. (12) 

For 7, it is easy to derive from (12) that the conditional maximum point given (3 is 
the same as in (6). On the other hand, given 7, (12) can be written as a function of 

A 

Q(J3 I 7, S (fc) ) = - j/3'(V + pD" 1 ^ - 2u'/3|, 

where V and u are defined in (7). This implies that the conditional maximum point 
of f3 is 

(3 = (V + pD- 1 )- 1 ^ (13) 
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Cycling through every column generates a sequence of CM steps. The total number 
of CM-steps is 2p because there are p columns and for each column the CM involves 
two steps: one for (3 and the other for 7. The ECM algorithm can be summarized as 
follows: 

ECM algorithm Given input (S,p), start with £ (0) , at the (k + l)th iteration, k = 

0. 1..... 

1. Let S (fc+1) = S (fc) . 

2. E-step: Compute Q as in (11). 

3. CM-step: For i = 1, . . . , p, 

(a) Partition and S as in (2). 

(b) Compute 7 as in (6) and (3 as in (13). 

(c) Update <t^ +1) = /3, <rg +1) = /?', ag +1) = 7 + Z^ 1 /?. 

4. Let k — k + 1 and repeat (l)-(4) until convergence. 

2.3 Algorithm convergence 

The convergence of the proposed (block) coordinate descent algorithm can be ad- 
dressed by the theoretical innovation for block coordinate descent methods for non- 
differentiable minimization by Tseng (2001). The key to the application of the gen- 
eral theory there to our algorithm is the separability of the non-differentiable penalty 
terms in (1). First, from (6) and (8), the objective function g has a unique minimum 
point in each coordinate block. This satisfies the conditions of Part (c) of Theorem 
4.1 in Tseng (2001) and hence implies that the algorithm converges to a coordinate- 
wise minimum point. Second, because all directional derivatives exist, by Lemma 
3.1 of Tseng (2001), each coordinatewise minimum point is a stationary point. A 
close-related argument has been given by Breheny & Huang (2011) to show the con- 
vergence of coordinate decent algorithm for nonconvex penalized regression models. 

The convergence of the proposed ECM algorithm follows directly from the results 
of Meng & Rubin (1993). First, the analytical solutions (6) and (13) of CM-steps are 
unique for any |cx-^| ^ 0. Second, it can be easily seen that the construction of the 2p 
CM-steps satisfies the "space filling" condition for any £ because the CM-steps cycle 
through the whole parameter space. Thus, the two conditions of Theorem 3 of Meng 
& Rubin (1993) hold for the proposed ECM and so all limiting points of the ECM 
sequence {E^} are stationary points of (1). 

Using the latent variables r makes the EM-type optimization simpler than direct 
optimization, however, it has some issues. The ECM must be initialized at ^ 
for all % < j. Otherwise, from (11), the criterion function Q(S | = 00, the 
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algorithm will then stuck. Another related issue is that, although the sequence } 
will possibly converge to zero, they cannot be identically equal to zero. We want to 
point out that these issues are not unique to the covariance graphical models. Existing 
EM-type algorithms based on the latent scale parameters for fitting regularized linear 
models (e.g., Poison & Scott 2011; Armagan et al. 2011) have the same problems. 

3 Comparison of algorithms 

We compare the computational aspects of the two algorithms in Section 2.1 and 2.2 
with Bien & Tibshirani (2011)'s algorithm. We consider two scenarios: 

• A sparse model taken from Bien & Tibshirani (2011) with cr^+i = cr^-i = 
0.4, an = 5 and zero otherwise. Here, 5 is chosen so that the condition num- 
ber of E is p. 

• A dense model with an = 2 and a^ = 1 for i ^ j. 

The algorithm of Bien & Tibshirani (2011) is coded in R with its built-in functions. 
To be comparable to it, we implemented the ECM and the coordinate descent algo- 
rithms in R without writing any functions in a compiled language. All computations 
were done on a Intel Xeon X5680 3.33GHz processor. 

For either the sparse or the dense model, we first generated two datasets of dimen- 
sion (p, n) = (100, 200) and (p, n) = (200, 400), respectively. Then, for each of the four 
datasets, we applied the three algorithms for a range of p values. All computations 
were initialized at the sample covariance matrix, i.e., = S. For Bien & Tibshirani 
(2011)'s algorithm, we followed the default setting of tuning parameters provided 
by the "spcov" package (http : //cran.r-project . org/web/packages/ spcov/index . 
html). For the ECM and the coordinate descent algorithms, we used the same cri- 
terion as Bien & Tibshirani (2011)'s algorithm to stop the iterations: The procedure 
stops when the change of the objective function is less than 10 3 . 

First, we report the performance of numerical stability. In the case of sparse mod- 
els and p = 200, Bien & Tibshirani (2011)'s algorithm would not run. A careful 
inspection of the algorithm details reveals that the Newton-Raphson step there to 
find 5- fails in this case. The same problem occurs quite often when we apply Bien & 
Tibshirani (2011)'s algorithm to different datasets generated from different E. This 
suggests that Bien & Tibshirani (2011)'s algorithm may lack numerical stability. It 
may be possible to reduce this numerical problem by calibrating some of Bien & Tib- 
shirani (2011)'s tuning parameters such as the initial value for the Newton-Raphson 
method. In any event, it may be safe to conclude that Bien & Tibshirani (2011)'s 
algorithm requires either very careful tuning or lacks stability in certain cases. 

Now, we compare the computing speed. The four panels in Figure 1 display the 
CPU times of the three algorithms for each of the four datasets respectively. CPU 
time in seconds is plotted against the total number of off-diagonal non-zero elements 
estimated by the coordinate descent algorithm. The ECM and the coordinate descent 
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algorithms are much faster than the Bien & Tibshirani (2011)'s algorithm except for 
the dense model and p = 200 in which the ECM is comparable with Bien & Tibshi- 
rani (2011). Moreover, the coordinate descent algorithm seems to be particularly 
attractive for larger p as its run time generally decreases when the total number of 
estimated non-zero elements decreases. 

Next, we examine the ability of the algorithms to search for minimum points. To 
do so, we compared the minimum values of the objective functions achieved by each 
algorithm. For each dataset and each p, We calculated relative minimum values of 
objective functions defined as: 

g{Y>BT) ~ 9(Eecm), 9&cd) ~ 9&ecm), (14) 

where T, B t, Sod and T, E cm are the minimum points found by Bien & Tibshirani 
(2011), the coordinate descent and the ECM algorithms, respectively. Thus, a neg- 
ative value indicates that the algorithm finds better points than the ECM procedure 
and a smaller relative objective function indicates a better relative performance of 
the method. The four panels in Figure 2 display the relative minimum values of the 
objective functions for each of the four datasets, respectively. The relative minimum 
values are plotted as functions of the total number of non-zero elements estimated by 
the coordinate descent algorithm. As can be seen, the coordinate descent algorithm 
performs best as it always returns the lowest objective function among the three meth- 
ods. The ECM converges to points that have slightly higher objective function than 
Bien & Tibshirani (2011) in dense scenario (Panel (c) and (d)), however, the differ- 
ence is small (less than 0.05). Bien & Tibshirani (2011) seems to find points that are 
far less optimal than the coordinate descent or ECM algorithms in certain cases as is 
suggested by the spikes and dips in panel (a) and (b) where the differences in the 
objective function are larger than 1. 

Finally, it is known that, for nonconvex problems, any optimization algorithms 
are not guaranteed to converge to a global minimum. It is often recommended 
to run algorithms at multiple initial values. We then wish to compare the perfor- 
mance of the algorithms under different initial values. In the previous experiments, 
all computations were initialized at the full sample covariance matrix T,^ = S. 
To be different, it is natural to initialize them at the other extreme case in which 
= diag(sn, . . . , s pp ). However, this initial value can only be used for the coordi- 
nate descent but not for ECM as discussed before, since the ECM iteration will stuck 
at S (0) and never move when the off-diagonal elements are exactly equal to zero. To 
avoid this problem, we start the ECM at diag(sn, . . . , s pp ) + 10 -3 to avoid exact zeros 
on the off-diagonal elements. For each of the four datasets, we picked three different 
values of p so that they represent three different levels of sparsity based on the exper- 
iment before. For each of the four datasets and each of the three ps, we ran the three 
algorithms starting from the new set of initial values: 

— ^bt = diag(sn, . . . , s P p), ^ecm = diag(sn, . . . , s pp ) + 10 3 . 



8 



(a) (b) 




)' 1 1 1 1 1 o 1 1 1 1 

1000 2000 3000 4000 5000 0.5 1 1.5 2 

Total number of estimated non-zero entries Total number of estimated non-zero entries y < n 4 



Figure 1: CPU time (in seconds) is plotted against the number of non-zero edges 
estimated for four different models and three different algorithms. The four models 
are: sparse £ and p = 100 (Panel a); sparse £ and p = 200 (Panel b); dense £ and 
p = 100 (Panel c), and dense S and p = 200 (Panel d). The three algorithms are: 
Bien & Tibshirani (2011) (BT, solid line), ECM (dotted line) and coordinate descent 
(CD, dashed line). Each computation is initialized at the sample covariance matrix 
E<°> = S. 
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Figure 2: Relative minimum value of objective function, denned in (14), is plotted 
against the number of estimated non-zero elements for four different models and 
three different algorithms. The four models are: sparse £ and p = 100 (Panel a); 
sparse S and p = 200 (Panel b); dense S and p = 100 (Panel c), and dense S 
and p = 200 (Panel d). The three algorithms are: Bien & Tibshirani (2011) (BT, solid 
line), ECM (dotted line) and coordinate descent (CD, dashed line). Each computation 
is initialized at the sample covariance matrix = S. 
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We recorded the CPU time, sparsity of the minimum points and the minimum value 
of the objective function. Table 1 reports these measures. Three things are worth 
noting. First, Bien & Tibshirani (2011)'s algorithm seems to get stuck at the initial 
value of a diagonal matrix in all cases. In contrast, the proposed algorithms work 
fine and find reasonable minimum points of £ because the minimum values of the 
objective function and level of sparsity are quite close to those obtained from starting 
at the full sample covariance matrix. We have tried to initialize Bien & Tibshirani 
(2011)'s algorithm at = diag(sn, . . . , s pp ) + 1CT 3 but found that it still gets stuck 
after a few iterations. It may be possible to improve the poor performance of Bien & 
Tibshirani (2011)'s algorithm by adjusting some of its tuning parameters. However, 
it may be safe to conclude that Bien & Tibshirani (2011)'s algorithm requires either 
very careful tuning or performs badly at this important initial value. Second, initial 
values indeed matter. Comparing the results between full and diagonal initial values, 
we see substantial differences in all three measures. In particular, the limiting points 
from the diagonal initial matrices are sparser than those from the full initial matri- 
ces. This is not surprising because of the drastic difference in sparsity between these 
two starting points. Third, comparing the minimum values of the objective function 
achieved by the three algorithms (last two columns), we see that the coordinate de- 
scent often reaches the lowest minimum values regardless of the initial points. The 
few exceptions (e.g., sparse S, p — 100, p = 0.01) when the coordinate descent is 
not the best seems to be the case that p is small and the fraction of the number of 
non-zero elements is large. 

4 Discussion 

We have developed two alternative algorithms for fitting sparse covariance graphical 
lasso models using L 1 penalty. These two algorithms are shown to be much easier to 
implement, significantly faster to run and numerically more stable than the algorithm 
of Bien & Tibshirani (2011). Both MATLAB and R software packages implementing 
the new algorithms for solving covariance graphical models are freely available from 
the author's the website of the paper. 
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Table 1: Performance of three algorithms starting at two different initial values: Full, 
E (0) = S; and Diag, S (0) = diag(a u , . . . , s pp ). The "CPU Time" columns present the 
CPU run time in seconds; the "% Nonzero" columns present the percentage of nonzero 
elements in the minimum points; the "Objective Func" columns present the minimum 
value of the objective function. The three algorithms are: Bien & Tibshirani (2011) 
(BT), the Expectation/Conditional Maximization algorithm (ECM) of Section 2.2 and 
the coordinate descent (CD) of Section 2.1. 
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